Preparations

Load the necessary libraries

library(car)       #for regression diagnostics
library(broom)     #for tidy output
library(ggfortify) #for model diagnostics
library(sjPlot)    #for outputs
library(knitr)     #for kable
library(effects)   #for partial effects plots
library(emmeans)   #for estimating marginal means
library(ggeffects) #for plotting marginal means
library(MASS)      #for glm.nb
library(MuMIn)     #for AICc
library(tidyverse) #for data wrangling
library(modelr)    #for auxillary modelling functions
library(DHARMa)    #for residual diagnostics plots
library(performance) #for residuals diagnostics
library(see)         #for plotting residuals
library(patchwork) #grid of plots
library(scales)    #for more scales
theme_set(theme_light())

Scenario

An ecologist studying a rocky shore at Phillip Island, in southeastern Australia, was interested in how clumps of intertidal mussels are maintained (Quinn 1988). In particular, he wanted to know how densities of adult mussels affected recruitment of young individuals from the plankton. As with most marine invertebrates, recruitment is highly patchy in time, so he expected to find seasonal variation, and the interaction between season and density - whether effects of adult mussel density vary across seasons - was the aspect of most interest.

The data were collected from four seasons, and with two densities of adult mussels. The experiment consisted of clumps of adult mussels attached to the rocks. These clumps were then brought back to the laboratory, and the number of baby mussels recorded. There were 3-6 replicate clumps for each density and season combination.

Format of quinn.csv data files

season density recruits sqrtrecruits group
Spring Low 15 3.87 SpringLow
.. .. .. .. ..
Spring High 11 3.32 SpringHigh
.. .. .. .. ..
Summer Low 21 4.58 SummerLow
.. .. .. .. ..
Summer High 34 5.83 SummerHigh
.. .. .. .. ..
Autumn Low 14 3.74 AutumnLow
.. .. .. .. ..
season Categorical listing of season in which mussel clumps were collected - independent variable
density Categorical listing of the density of mussels within mussel clump - independent variable
recruits The number of mussel recruits - response variable
sqrtrecruits Square root transformation of recruits - needed to meet the test assumptions
groups Categorical listing of season/density combinations - used for checking ANOVA assumptions

Mussel

Read in the data

quinn = read_csv('../data/quinn.csv', trim_ws=TRUE) %>%
  janitor::clean_names()
glimpse(quinn)
## Rows: 42
## Columns: 5
## $ season       <chr> "Spring", "Spring", "Spring", "Spring", "Spring", "Sprin…
## $ density      <chr> "Low", "Low", "Low", "Low", "Low", "High", "High", "High…
## $ recruits     <dbl> 15, 10, 13, 13, 5, 11, 10, 15, 10, 13, 1, 21, 31, 21, 18…
## $ sqrtrecruits <dbl> 3.872983, 3.162278, 3.605551, 3.605551, 2.236068, 3.3166…
## $ group        <chr> "SpringLow", "SpringLow", "SpringLow", "SpringLow", "Spr…
summary(quinn)
##     season            density             recruits      sqrtrecruits  
##  Length:42          Length:42          Min.   : 0.00   Min.   :0.000  
##  Class :character   Class :character   1st Qu.: 9.25   1st Qu.:3.041  
##  Mode  :character   Mode  :character   Median :13.50   Median :3.674  
##                                        Mean   :18.33   Mean   :3.871  
##                                        3rd Qu.:21.75   3rd Qu.:4.663  
##                                        Max.   :69.00   Max.   :8.307  
##     group          
##  Length:42         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Since we intend to model both season and density as categorical variables, we need to explicitly declair them as factors.

quinn <- quinn %>% janitor::clean_names() %>%
  mutate(season = factor(season, 
                         levels=c('Spring', 'Summer', 
                                  'Autumn', 'Winter')),
         density = factor(density))

Exploratory data analysis

Model formula: \[ \begin{align} y_i &\sim{} \mathcal{Pois}(\lambda_i)\\ ln(\mu_i) &= \boldsymbol{\beta} \bf{X_i}\\[1em] \end{align} \]

where \(\boldsymbol{\beta}\) is a vector of effects parameters and \(\bf{X}\) is a model matrix representing the intercept and effects of season, density and their interaction on mussel recruitment.

ggplot(quinn, aes(y=recruits, x=season, fill=density)) +
     geom_boxplot()

Conclusions:

  • there is clear evidence of non-homogeneity of variance
  • specifically, there is evidence that the variance is related to the mean in that boxplots that are lower on the y-axis (low mean) also have lower variance (shorter boxplots)
  • this might be expected for count data and we might consider that a Poisson distribution (which assumes that mean and variance are equal - and thus related in a very specific way).

Lets mimic the effect of using a log link, by using log scaled y-axis.

ggplot(quinn, aes(y=recruits, x=season, fill=density)) +
  geom_boxplot() +
  scale_y_log10()

Conclusions:

  • that is an improvement

Fit the model

quinn.glmG <- glm(log(recruits + 1) ~ density*season, data = quinn, family = gaussian)
quinn.glm <- glm(recruits ~ density*season, data = quinn, family = poisson(link = 'log'))

Model validation

autoplot(quinn.glm, which = 1:6)

Cook’s D is off, so although residuals and QQ are fine, there’s something wrong with the model

quinn.resid <- simulateResiduals(quinn.glm, plot = TRUE)

Problem with deviance = More variance than we should expect Model is overdispersed so the model is not reliable: potential because this si real data and there are more variables that explain recruitment, not just our variables. We can add an ‘unit level random effect’ to our model = add a variable that makes each variable unique and ‘soaks’ some of the variance. It will take one DF and shrinks the other estimates, making them less obvious. Maybe some of the zeros are not actually zeros because the recruits were to small to be counted (zero inflated model).

Can also test for dispersion:

testDispersion(quinn.resid) #red line is your supposed actual disperision, the histogram is what it actually should be

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 3.1726, p-value < 2.2e-16
## alternative hypothesis: two.sided
performance::check_overdispersion(quinn.glm) #3.309 = 3x more variable that there would normally be
## # Overdispersion test
## 
##        dispersion ratio =   3.309
##   Pearson's Chi-Squared = 112.497
##                 p-value = < 0.001

Diagnosis = overdispersersion in the model (3x)

testZeroInflation(quinn.resid)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 7.6923, p-value = 0.04
## alternative hypothesis: two.sided
performance::check_zeroinflation(quinn.glm) #shows only 2 extra zeros than expected
## # Check for zero-inflation
## 
##    Observed zeros: 2
##   Predicted zeros: 0
##             Ratio: 0.00

Checking for overdispersion and zero inflation together actually provides a better picture. It showed that the model is overdispersed but it then showed that it’s not really zero inflated as 2 zeros are not much.

Different model

We’ll fit a negative binomial by dividing data into zeros and ones.

quinn.nb <- MASS::glm.nb(recruits ~ density*season, data = quinn)
quinn.resid <- simulateResiduals(quinn.nb, plot = TRUE)

Compare:

AICc(quinn.glm, quinn.nb)

Parial plots

plot_model(quinn.nb, type = 'eff', terms = c('season', 'density'))

Model investigation / hypothesis testing

summary(quinn.nb) #it's on a log-scale
## 
## Call:
## MASS::glm.nb(formula = recruits ~ density * season, data = quinn, 
##     init.theta = 9.022467857, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8704  -0.8274   0.0000   0.4999   2.1866  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               2.3026     0.1875  12.283  < 2e-16 ***
## densityLow                0.1133     0.2742   0.413  0.67934    
## seasonSummer              1.5721     0.2389   6.581 4.69e-11 ***
## seasonAutumn              0.6763     0.2492   2.714  0.00664 ** 
## seasonWinter             -0.5680     0.2881  -1.971  0.04870 *  
## densityLow:seasonSummer  -0.8970     0.3509  -2.556  0.01059 *  
## densityLow:seasonAutumn  -0.1881     0.3788  -0.496  0.61955    
## densityLow:seasonWinter  -0.8671     0.5338  -1.624  0.10432    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(9.0225) family taken to be 1)
## 
##     Null deviance: 183.269  on 41  degrees of freedom
## Residual deviance:  54.883  on 34  degrees of freedom
## AIC: 293.09
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  9.02 
##           Std. Err.:  3.69 
## 
##  2 x log-likelihood:  -275.095
tidy(quinn.nb, conf.int = TRUE, exponentiate = TRUE)

N.B. if the model isn’t balanced, an ANOVA table will change the outputs depending on the order of the variables

Can use emmeans to do a pairwise to see the effect of each season

emmeans(quinn.nb, pairwise ~ density|season, type = 'response')
## $emmeans
## season = Spring:
##  density response   SE  df asymp.LCL asymp.UCL
##  High       10.00 1.87 Inf      6.93     14.44
##  Low        11.20 2.24 Inf      7.57     16.58
## 
## season = Summer:
##  density response   SE  df asymp.LCL asymp.UCL
##  High       48.17 7.13 Inf     36.03     64.39
##  Low        22.00 3.55 Inf     16.03     30.19
## 
## season = Autumn:
##  density response   SE  df asymp.LCL asymp.UCL
##  High       19.67 3.23 Inf     14.26     27.13
##  Low        18.25 3.71 Inf     12.25     27.19
## 
## season = Winter:
##  density response   SE  df asymp.LCL asymp.UCL
##  High        5.67 1.24 Inf      3.69      8.70
##  Low         2.67 1.07 Inf      1.21      5.87
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
## season = Spring:
##  contrast   ratio    SE  df z.ratio p.value
##  High / Low 0.893 0.245 Inf -0.413  0.6793 
## 
## season = Summer:
##  contrast   ratio    SE  df z.ratio p.value
##  High / Low 2.189 0.480 Inf  3.577  0.0003 
## 
## season = Autumn:
##  contrast   ratio    SE  df z.ratio p.value
##  High / Low 1.078 0.282 Inf  0.286  0.7749 
## 
## season = Winter:
##  contrast   ratio    SE  df z.ratio p.value
##  High / Low 2.125 0.973 Inf  1.646  0.0999 
## 
## Tests are performed on the log scale
#on an absolute scale instead of fractional:
emmeans(quinn.nb, ~density|season) %>%
  regrid() %>%
  pairs()
## season = Spring:
##  contrast   estimate   SE  df z.ratio p.value
##  High - Low    -1.20 2.92 Inf -0.411  0.6812 
## 
## season = Summer:
##  contrast   estimate   SE  df z.ratio p.value
##  High - Low    26.17 7.97 Inf  3.284  0.0010 
## 
## season = Autumn:
##  contrast   estimate   SE  df z.ratio p.value
##  High - Low     1.42 4.92 Inf  0.288  0.7734 
## 
## season = Winter:
##  contrast   estimate   SE  df z.ratio p.value
##  High - Low     3.00 1.64 Inf  1.829  0.0673
newdata <- emmeans(quinn.nb, ~density|season, type = 'response') %>% 
  as.data.frame()

ggplot(data = newdata, mapping = aes(y = response, x = season, fill = density)) + 
  geom_pointrange(aes(ymin = asymp.LCL, ymax = asymp.UCL, 
                      shape = density),
                  position = position_dodge(width = 0.1)) + #so that they don't overlap
  theme_classic() +
  theme(axis.title.x = element_blank(),
        legend.position = c(0.01, 1), legend.justification =  c(0, 1)) +
  scale_shape_manual(values = c(21, 22))

Predictions

Summary figures

Fitting a zero inflated model:

library(pscl)
quinn.zip <- zeroinfl(recruits ~ density*season | 1, data=quinn,  dist='poisson') #the "|1" fits basically a null model
#quinn.resid <- simulateResiduals(quinn.zip,  plot=TRUE) does not support zero inflated models
summary(quinn.zip)
## 
## Call:
## zeroinfl(formula = recruits ~ density * season | 1, data = quinn, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4510 -0.8645  0.1262  0.7225  2.8711 
## 
## Count model coefficients (poisson with log link):
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               2.3026     0.1291  17.836  < 2e-16 ***
## densityLow                0.1133     0.1858   0.610  0.54191    
## seasonSummer              1.5721     0.1419  11.081  < 2e-16 ***
## seasonAutumn              0.6763     0.1586   4.266 1.99e-05 ***
## seasonWinter             -0.5680     0.2147  -2.646  0.00815 ** 
## densityLow:seasonSummer  -0.8970     0.2134  -4.202 2.64e-05 ***
## densityLow:seasonAutumn  -0.1881     0.2381  -0.790  0.42957    
## densityLow:seasonWinter   0.2165     0.4534   0.478  0.63300    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.0037     0.7305  -4.112 3.92e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 13 
## Log-likelihood: -151.3 on 9 Df
plogis(-3) #approximately 5% of zeros are not real zeros (-3 comes from the zero inflation model coefficients estimate), 5% is a small possibility and it makes sense according to our previous analyses
## [1] 0.04742587
#tidy(quinn.zip,  conf.int=TRUE, exponentiate = TRUE)
exp(-3.0037)
## [1] 0.0496032
quinn.zip1 <- zeroinfl(recruits ~ density*season | season, data=quinn,  dist='poisson') #was the rate any different in different seasons? ('|season')
summary(quinn.zip1)
## 
## Call:
## zeroinfl(formula = recruits ~ density * season | season, data = quinn, 
##     dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5327 -1.0591  0.1537  0.9216  3.6831 
## 
## Count model coefficients (poisson with log link):
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               2.3026     0.1291  17.836  < 2e-16 ***
## densityLow                0.1133     0.1858   0.610  0.54191    
## seasonSummer              1.5721     0.1419  11.081  < 2e-16 ***
## seasonAutumn              0.6763     0.1586   4.266 1.99e-05 ***
## seasonWinter             -0.5680     0.2147  -2.646  0.00815 ** 
## densityLow:seasonSummer  -0.8970     0.2134  -4.202 2.64e-05 ***
## densityLow:seasonAutumn  -0.1881     0.2381  -0.790  0.42958    
## densityLow:seasonWinter   0.2291     0.4375   0.524  0.60045    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -2.157e+01  1.454e+04  -0.001    0.999
## seasonSummer -2.543e-08  2.013e+04   0.000    1.000
## seasonAutumn -2.119e-08  2.107e+04   0.000    1.000
## seasonWinter  2.031e+01  1.454e+04   0.001    0.999
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 13 
## Log-likelihood:  -148 on 12 Df
exp(-3.0037)
## [1] 0.0496032
quinn.zinb <- zeroinfl(recruits ~ density*season | 1, data=quinn,  dist='negbin')
AICc(quinn.zip,  quinn.zinb)
summary(quinn.zinb)
## 
## Call:
## zeroinfl(formula = recruits ~ density * season | 1, data = quinn, dist = "negbin")
## 
## Pearson residuals:
##        Min         1Q     Median         3Q        Max 
## -1.981e+00 -7.511e-01 -1.522e-05  5.289e-01  2.869e+00 
## 
## Count model coefficients (negbin with log link):
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               2.3026     0.1875  12.283  < 2e-16 ***
## densityLow                0.1133     0.2742   0.413  0.67933    
## seasonSummer              1.5721     0.2389   6.580 4.69e-11 ***
## seasonAutumn              0.6763     0.2492   2.714  0.00664 ** 
## seasonWinter             -0.5680     0.2881  -1.971  0.04869 *  
## densityLow:seasonSummer  -0.8969     0.3509  -2.556  0.01059 *  
## densityLow:seasonAutumn  -0.1881     0.3788  -0.496  0.61957    
## densityLow:seasonWinter  -0.8671     0.5338  -1.624  0.10432    
## Log(theta)                2.1997     0.4091   5.378 7.55e-08 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -15.29     453.38  -0.034    0.973
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 9.0223 
## Number of iterations in BFGS optimization: 43 
## Log-likelihood: -137.5 on 10 Df
exp(-15.29) #none of them are false zeros because the negative binomial already handled the small amount of zeros
## [1] 2.288956e-07

References

Quinn, G. P. 1988. “Ecology of the Intertidal Pulmonate Limpet Siphonaria Diemenensis Quoy Et Gaimard. II Reproductive Patterns and Energetics.” Journalofexperimentalmarinebiologyandecology 117: 137–56.